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ABSTRACT 

The field of regulatory genomics today is charac- 
terized by the generation of high-throughput data 
sets that capture genome-wide transcription factor 
(TF) binding, histone modifications, or DNAsel 
hypersensitive regions across many cell types and 
conditions. In this context, a critical question is how 
to make optimal use of these publicly available 
datasets when studying transcriptional regulation. 
Here, we address this question in Drosophila 
melanogaster for which a large number of 
high-throughput regulatory datasets are available. 
We developed i-cisTarget (where the V stands for 
integrative), for the first time enabling the discovery 
of different types of enriched 'regulatory features' in 
a set of co-regulated sequences in one analysis, 
being either TF motifs or 'in vivo' chromatin 
features, or combinations thereof. We have 
validated our approach on 15 co-expressed gene 
sets, 21 ChIP data sets, 628 curated gene sets and 
multiple individual case studies, and show that 
meaningful regulatory features can be confidently 
discovered; that bona fide enhancers can be 
identified, both by in vivo events and by TF motifs; 
and that combinations of in vivo events and 
TF motifs further increase the performance of 
enhancer prediction. 

INTRODUCTION 

Understanding the principles of transcriptional regulation 
remains one of the greatest challenges in functional 
genomics, despite years of intensive investigations. 



Spectacular advances in experimental technologies, such 
as ChlP-seq (1), FAlRE-seq (2) and RNA-seq (3) 
represent obvious breakthroughs in this field, as they 
allow interrogating regulatory activity at the genome-wide 
scale, and are becoming available to most research groups 
(4,5). However, interpretation of these genome-wide 
datasets, as well as their integration into a unified model 
of cz'.v-regulation that includes computational motif predic- 
tions remains challenging for many biologists, given the 
amount of information and the lack of appropriate tools. 
Two typical situations are often encountered in genomics 
studies. First, given a set of co-expressed genes, an imme- 
diate question is whether these genes share regulatory 
motifs and, if so, which transcription factors (TFs) may 
co-regulate these genes, or a significant subset thereof. 
Secondly, given a set of genomic loci identified through 
DNase-Seq, FAIRE-seq, or ChlP-Seq, motif discovery 
can be applied in a similar fashion as to co-expressed 
gene sets, with the aim to confirm the presence of the 
targeted TF (for ChlP-seq against TFs), uncover novel 
co-factors, but also disentangle the noisy input set into 
direct target regions of different TFs. Many tools have 
been developed in recent years to predict enriched motifs 
in a set of co-expressed genes [e.g. using proximal 
promoter sequences such as Clover (6), oPOSSUM (7), 
PASTAA (8) and PSCAN (9); or to discover de novo 
motifs, such as oligo-analysis (10), MEME (11) and 
MotifSampler (12)]. With the increased use of ChlP-Seq, 
several of these methods have been adjusted to also 
analyse ChIP peak datasets (e.g. oPOSSUM), and 
several new methods have appeared, such as peak-motifs 
(13) and MEME-ChIP (14). 

However, with the increasing amount of genome-wide 
data being generated, another question could be whether 
some previously identified events (such as DNA binding 
or histone modifications through ChIP, or DNAse 
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hypersensitive sites, collectively referred to as l in vivo 
events' or iVE in the rest of this article), possibly in a 
different biological context, are found enriched in our 
set of genes or genomic loci. For example, one could ask 
whether genes found to be highly expressed during a 
differentiation process through RNA-seq are enriched in 
ChlP-seq peaks for a particular TF, identified in an 
independent study; or, whether open chromatin regions 
isolated using FAIRE-seq are enriched in particular 
histone modifications, identified in ChlP-seq experiments. 
In addition, besides individual enriched motifs or iVEs, it 
often appears that a particular combination of these 
features drives the activity of regulatory regions, as was 
shown in many recent studies (15-17). Hence, while 
enrichment in a particular motif indicates that a region 
is potentially targeted by a TF, its combination with a 
particular histone modification indicates that the regions 
are actually active in the condition investigated. Given 
their coverage in terms of conditions investigated, 
large-scale consortia such as ENCODE (18), modEncode 
(4), or the Berkeley Drosophila Transcriptional Network 
Project (BDTNP) (19) provide a unique opportunity to 
address these questions. Hundreds of datasets, 
investigating histone modifications, chromatin states, 
chromatin-binding protein and TF-binding events for 
many cell-lines or developmental stages make it more 
and more likely that an independent dataset will 
show an enrichment for some features present in these 
large-scale repositories. 

Here we present a novel method, called i-cisTarget 
(integrative c is Tar get ), with the aim to tackle these two 
challenges: (i) identifying enriched regulatory features, 
as motifs or iVEs, in a set of co-expressed genes or 
related genomic loci; (ii) using these features to predict 
ra-regulatory modules (CRMs), either around the set of 
genes provided, or among the genomic loci submitted, 
and infer regulatory networks. We have implemented 
i-cisTarget for Drosophila, as a proof-of-concept, but 
also because Drosophila is one of the mostly used 
multi-cellular model organisms to study transcriptional 
regulation and the cw-regulatory code during develop- 
ment, given its relatively compact genome, the genetic 
tools to manipulate gene regulatory networks, and its suit- 
ability for in vivo enhancer validations. Indeed, many of 
the CRM prediction methods based on TF motif cluster- 
ing and TF motif conservation have been originally 
developed and validated in Drosophila, such as Cluster- 
Buster (20), SWAN (21), Stubb (22), StubbMS (23), 
Ahab (24), cisDecoder (25) and e-cisAnalyst (26). In 
i-cisTarget we use an approach based on whole-genome 
rankings, combined with recovery statistics (27). This 
approach has been proven to be very powerful for motif 
discovery, both in Drosophila (28,29) and in the human 
genome (30,31). Here we modify this methodology to 
calculate enrichment for iVEs, motif features, motif 
combinations, iVE combinations and mixed motif/iVE 
combinations (hence 'regulatory features' in general). 
Importantly, i-cisTarget allows for the analysis of both 
genomic loci (e.g. ChIP peak datasets) and co-expressed 
gene sets (e.g. from microarrays and RNA-Seq). 



Methods that incorporate iVEs for CRM prediction 
have been recently developed, such as CENTIPEDE 
(32), CHROMIA (33), PriorsEditor (34) or 
CRMDecoder (35) but these methods focus on genome- 
wide CRM prediction using user-selected 'tracks' with 
iVEs. CRMDecoder does extract enriched iVEs from 
a training set of CRMs, before applying these to 
whole-genome scoring (i.e. supervised CRM discovery), 
but this method is not designed to work on gene sets, 
nor does it use TF motifs to predict TF target CRMs 
and regulatory networks. 

Given that CRM prediction is greatly aided by iVEs 
(36,37), a key challenge is to develop methods that 
identify the most informative iVEs using the results from 
high-throughput experiments as input. In addition to 
feature discovery, an important challenge is to use these 
features in the next step to identify direct target CRMs of 
TFs and to map gene regulatory networks. In this article, 
we present a new method, called i-cisTarget, to tackle this 
challenge. We first describe the methodology used in 
i-cisTarget, and then show the results of several large-scale 
validation experiments and its application to large bench- 
mark datasets that we assembled for this study, including 
ChIP data and co-expressed gene sets. We assess the 
performance of i-cisTarget in terms of motif and iVE 
enrichment, the prediction of CRMs, and the prediction 
of target genes. Finally, we compare the features and 
the performance of i-cisTarget to other available 
computational methods for as-regulatory sequence 
analysis. 

MATERIALS AND METHODS 

Partitioning the non-coding genome 

Our analyses rely on a partition of the non-coding regions 
of the genome into non-overlapping regions. This 
partitioning is based on the PhastCons score (38) 
(Supplementary Figure SI, black track). We use the 
PhastCons WIG files indicating, for each nucleotide of 
the genome, its PhastCons score between 0 and 1. This 
score is averaged over a sliding window of 100 bp, and 
regions of at least 100 bp with an averaged PhastCons 
score >0.1 are extracted. This yields a set of disjoint 
regions (Supplementary Figure SI, orange track), with 
gaps between them. Suppose we have two regions ri and 
rj, and we call Gij the gap between them. We now extend 
the regions ri and rj on both ends, up to the midpoint of 
Gij. The extended regions are now called Ri and Rj. The 
full set of {Ri} represents a partitioning of the whole 
genome (Supplementary Figure SI, light orange track). 
Since we are interested in regulatory features outside 
coding regions, we now subtract all coding regions from 
our set {Ri}; this subtracting is done at the nucleotide 
level, i.e. certain regions will be shortened, while others 
fully included in coding regions will be fully removed 
(Supplementary Figure SI, yellow track). Additionally, 
regions containing a binding site for a class I insulator 
are split at the binding site (Supplementary Figure SI, 
green track). The exon subtraction and insulator splitting 
might produce small regions at the 5'/3' edges of genes or 
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near insulator-binding sites; in order to avoid the presence 
of these small regions, we scan all resulting regions and 
merge regions < 500 bp with surrounding regions until the 
resulting merged region is >500bp. Note that we merge 
regions with the smallest of its surrounding regions first. 
The complete procedure yields 136 353 regions (referred to 
as 136K regions) of average size 790 bp (median 751 bp) 
that cover the entire non-coding part of the Drosophila 
melanogaster genome (Supplementary Figure SI, red 
track). Using all 900 CRMs and reporter constructs in 
the RedFly 3.0 database, we determined that on average, 
85% of their sequence is covered by a single region of our 
136K partition. This proportion reaches 90% if we restrict 
this analysis to the 103 CRMs referenced as 'minimal 
CRMs' in the database, i.e. those that have been experi- 
mentally shown to be of minimal size. In addition, the 
majority of these minimal CRMs (all except three) do 
not show insulator sites. Thus, the genomic regions 
correspond well to CRMs and candidate CRMs. 

Scoring with position weight matrices 

The entire 136K region collection is scored for the 
presence of (conserved) homotypic clusters of binding 
sites modelled by a position weight matrix (PWM) as 
described previously (28). Briefly, Cluster-Buster (20) is 
used for the detection of these motif clusters. Cluster dis- 
covery is performed for the complete library of PWMs, 
resulting in a score-based ranking of the 136K regions for 
each motif in the collection. To improve the specificity of 
the predicted motif clusters, sequence conservation is also 
taken into account. Orthologous regions in 11 related 
Drosophila species are scored for the presence of these 
clusters, namely D. simulans, D. yakuba, D. virilis, 
D. erecta, D. pseudoobscura, D. persimilis, D. ananassae, 
D. sechellia, D. grimshawi, D. mojavensis and D. wittistoni. 
To find the corresponding regions in these species, the 
UCSC liftOver procedure from the Kent tools suite (39) 
is used. LiftOver utilizes a pre-computed genome location 
transformation between different genome assemblies and 
species, using the chain files obtained from pair-wise 
whole genome alignments (40). We allow for non-unique 
mapping of a reference region to a related genome, taking 
only the region with the highest motif score into account 
for the final integration. 

Combining the rankings across species via order statis- 
tics (OS) culminates in a single ranking for each motif in 
the collection. OS is a probabilistic method to evaluate the 
probability ((/-value) of observing a particular configur- 
ation of ranks across the different related species by 
chance (41,42). This results in a g-value for each region 
based on the species specific ranks and thus effectively 
integrates all comparative genomics information in a 
single ranking for each PWM in our library, thereby 
allowing for motif movement within each region. 

Scoring with iVEs 

The definition of regions bound by TFs or marked by 
histone modifications from ChIP data depends on the 
peak calling algorithm used; many have been defined, 
but their level of agreement depend on the type of data 



considered, and the validity of the control model used. 
In order to remain as unbiased as possible, we decided 
to rely on continuous normalized density distributions 
(reads or tilling array intensities) rather than discrete inter- 
vals to score genomic regions. Normalized density distri- 
butions for ChlP-seq and ChlP-chip data are downloaded 
from the modEncode, BDTNP and Furlong Lab website 
in .wig or .sgr format, and converted to BedGraph format 
(note that this conversion is merely a format conversion, 
and does not correspond to a definition of ChIP peaks); 
each of the 136K regions is intersected with each of these 
continuous density profiles for each iVE using the 
BedTools (43), and the average per base score for each 
region and each profile is computed. Based on this score 
an overall ranking of all regions in the non-coding genome 
is derived. In this way, a collection of 420iVEs was 
compiled (see main text, Supplementary Table SI and 
Supplementary Materials and Methods section). 

Mapping gene signatures and ChIP peak locations to 
the genome partitioning 

For each D. melanogaster gene annotated in FlyBase 
release 5.37, a candidate regulatory region was defined. 
Based on these regulatory regions associated with 
a gene, the corresponding regions of the 136K genome 
partition are obtained using BedTools (43). By default 
this regulatory region is composed of the 5-kb upstream 
region, limited by the nearest upstream gene, the 5'-UTR 
and the first intron. We do not consider coding exons of 
input genes (genes in the co-expressed gene set) to be can- 
didate CRMs, therefore coding exons that reside in these 
putative regulatory regions are removed. Indeed, to our 
knowledge the known enhancers that overlap with coding 
exons are located in neighbouring genes, and not in 
the transcribed gene itself (44). We refer to a particular 
definition of candidate regulatory regions for a gene as a 
demarcation. Multiple other demarcations were 
assembled; including a demarcation defined not only 
based on the 5-kb upstream region, but extended with 
the full transcript of a gene excluding the coding exons. 
In addition, a demarcation that extends the aforemen- 
tioned demarcation by a 5-kb downstream region limited 
by the nearest downstream gene and a demarcation that 
combines a 10-kb upstream region with the full transcript 
and the 10-kb downstream region were also created. 
All these demarcations are available via the web interface 
and the effect on the performance was investigated on a 
benchmark of genesets (Supplementary Figure S2). For all 
the analyses performed in this article, we used demarca- 
tions based on the Release 5.12 (October 2008) FlyBase 
gene annotation. These candidate regulatory regions are 
initially determined for genes annotated with FBgn iden- 
tifiers. Via conversion tables available from FlyBase that 
map between different gene nomenclatures, demarcations 
for genes defined by annotation IDs (i.e. CG numbers) 
and symbols were also derived, enabling the analysis of 
gene signatures supplied in these different gene nomencla- 
tures. To determine the regions of the genome partition 
that correspond to an input set of ChIP peak genomic 
locations, we use BedTools. Only the regions that have 
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a minimal overlap of 40% with a ChIP peak or a 
candidate regulatory region are retained. The overlap 
fraction is defined as the number of nucleotides that 
overlap between region and ChIP peak or regulatory 
region divided by the number of nucleotides in the 
genome partition region. 

Enrichment analysis via cumulative recovery curves 

Enrichment is calculated using cumulative recovery curves 
as described before (28). Briefly, given a set of candidate 
(or 'foreground') regions, corresponding to the regulatory 
regions associated with a set of co-expressed genes or with 
the genomic locations of a collection of ChIP peaks, those 
features are identified for which the top fraction of 
their associated 136K regions ranking is enriched for 
these candidate regions. To this end, the recovery of 
these regions based on the ranking associated with each 
feature in our collection is assessed by calculating the cu- 
mulative recovery of these regions with increasing region 
rank. Of special interest is the early retrieval of foreground 
regions, therefore the area under the curve (AUC) for the 
top ranked regions is used as a metric to quantify the 
enrichment of these regions at the top of a ranking. 
The threshold that defines the 'top' is a parameter for 
the user and is set at 1% by default. 

The distribution of this AUC metric for all features 
provides a method to define exceptionally good 
recovery, as the normalized enrichment score [NES = 
(AUC-AUCmean)/AUCstd)]. The NES is computed 
for each feature, and only features associated with a 
recovery above a certain threshold are considered as 
enriched features. In the online i-cisTarget application, 
the threshold can be chosen by the user, and is set to 
NES > 2.5 by default. In the analysis of ChIP peak data 
sets, the motif signal is usually much stronger than in 
co-expressed gene sets, allowing for a more stringent 
setting of the threshold for ChIP (e.g. NES>4) to 
increase the specificity. Because features compiled from 
different sources may have different AUC distributions, 
we group these features in different databases. This 
allows for database-specific enrichment analysis. Note 
that the calculation of the feature rankings is performed 
only once and reused for multiple recovery analyses on 
different region sets. This effectively reduces the compu- 
tational burden for the calculation of recovery curves, 
making i-cisTarget an on-the-fly analysis tool. 

Validation of motif feature enrichment 

To assess the performance of i-cisTarget several bench- 
marks were created for which the responsible motif 
(or multiple motifs) for each gene signature and ChIP 
peak set in the benchmark is known, i.e. 15 gene signa- 
tures curated from the literature and 40 ChIP datasets 
(see 'Results' and 'Discussion' section). To compensate 
for the redundancy in the large PWM collection used by 
i-cisTarget, the enriched features for each benchmark set 
are clustered via STAMP (see section on PWM library 
and motif clustering in the Supplementary Materials and 
Methods) and the enriched cluster of motifs are ranked 
based on their best ranked motif. The metric used to 



quantify the performance of i-cisTarget is the best, 
i.e. lowest, rank of the motif cluster that contains at 
least one known motif. 

Candidate enhancer prediction 

i-cisTarget not only predicts the features enriched in a set 
of co-regulated genes or ChIP peaks. Using the recovery 
curve for an enriched feature, a list of candidate enhancers 
for that feature is also provided. These candidate enhan- 
cers are defined as a subset of the 'foreground' regions, 
i.e. the fraction of the 136K regions that map to the 
ChIP peaks or the putative regulatory regions associated 
with the set of co-regulated genes. More precisely, the 
maximum deviation of the recovery curve associated 
with an enriched feature from the average recovery over 
the entire feature database plus two standard deviations is 
taken as a threshold on the foreground regions, ranked 
based on the 136K regions ranking linked with that 
particular feature. Additionally, the enhancers that are 
not part of the foreground set but are nonetheless highly 
ranked for an enriched feature can also be retrieved. 
This extends the set of predicted target enhancers 
beyond the initial foreground set. 

Fine-tuning the analysis by combining enriched features 

The ranking-based framework allows for the creation 
of new features based on existing ones. Because every 
available feature is represented as a ranking of 
non-overlapping regions covering the complete 
non-coding genome of Drosophila, they can be combined 
via OS (see section on Scoring with PWMs). From a subset 
of the enriched in vivo and motif features in a i-cisTarget 
analysis new combinations can be created, either as 
pairwise combinations, or by collapsing more than two 
features into a single feature. These newly created 
features can be assigned to any feature database used 
in the initial analysis for assessment of their enrichment. 

Validation of candidate enhancer prediction 

To validate the enhancers predicted by i-cisTarget, we 
calculate the positive predictive value (PPV, also referred 
to as precision) and true positive rate (TPR, also called 
sensitivity or recall) based on a set of known true enhan- 
cers. True enhancers are defined as the subset of the 136K 
regions that overlap with the true binding regions for 
a TF, which are derived from a ChIP based in vivo assay 
or from a database compiling known CRMs such as the 
RedFly database. 

These metrics can be summarized in the F! measure, 
the harmonic mean of PPV and TPR: F[ -measure = 
2*[(TPR*PPV)/(TPR + PPV)]. A perfect prediction, 
i.e. only true enhancers are discovered and also all of 
them, corresponds to a value of 1. If all predicted 
enhancers are unknown, the Fi-measure is 0. 

Prediction of HSF and Mef2-binding sites 

For HSF, we used the binding locations determined in (44) 
consisting of 708 unbound HSF motifs. For MEF2, we 
predicted TFBS using the available MEF2 in TRANSAC 
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(V$MEF2_02, V$MEF2_03) using Matrix-Scan from the 
RSA-tools suite (45). We used a threshold of 9 on the 
TFBS score, and removed all sites that overlap a MEF2 
ChlP-chip peak within 100 bp, resulting in 4557 predicted, 
unbound MEF2 sites. 

Comparison with other tools 

The Linux executable was downloaded and installed 
for CRMDecoder (35); 370 bed files where downloaded 
from (i) the modEncode website and (ii) the BDTNP 
website (DNAsel hypersensitive sites only). 

Availability 

i-cisTarget is available via an easy to use web interface 
(http://med.kuleuven.be/lcb/i-cisTarget), providing access 
to a 'version 1.0' database of 4238 motif features and the 
420iVEs used in this article, and to an updated 'version 
2.0' database with 6383 PWMs and 536iVEs. All types of 
analysis can be performed via this interface, including 
combining enriched features, using the optimal targets of 
an enriched feature as input set for another analysis (serial 
i-cisTarget analysis), and retrieving genome-wide CRM 
predictions. Additionally, a UCSC custom track with 



the predicted motifs and CRMs for enriched PWM 
features can be calculated. 

Supplementary materials and methods 

The Supplementary Materials and Methods contain 
further information about: Collections of motifs and 
iVEs used; PWM library and motif clustering; support 
for both gene signatures and genomic loci; extending less 
abundant feature databases for enrichment analysis; 
Combining gene sets and genomic loci for validation 
purposes; and Analysis of FlyBase TermLink sets. 

RESULTS 

A new genome-wide scoring and enrichment scheme 

Figure 1 shows the components of i-cisTarget. The user 
input to i-cisTarget is a set of gene identifiers, for example 
a set of co-expressed genes, or a set of genomic loci 
(e.g. CMP peaks) in a bed file. The output is a list of 
enriched features, either motifs or iVEs, and for each of 
these features, a list of highly enriched regions for this 
particular feature, representing potential CRMs. The 
input set is mapped to a database of 136353 predefined 
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Figure 1. Flowchart of i-cisTarget. The 136K regions are scored in batch (i.e. offline) with collections of PWMs and iVEs, yielding PWM and iVE 
rankings respectively. An input set of genes or genomic loci is mapped to the 136K set to obtain a set of foreground sequences. The enrichment of 
the foreground sequences is calculated in all rankings using recovery curves and statistics. Top ranking regions for enriched features represent 
candidate CRMs. 
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genomic regions — called the 136K set — that covers the 
entire non-coding genome (see 'Materials and Methods' 
section Partitioning the non coding genome and 
Supplementary Figure SI). The subset of genomic 
regions that overlaps with the input set determines the 
foreground set (see 'Materials and Methods' section 
Mapping gene signatures and Supplementary Figure S3 
for details on how the optimal overlap is determined). 
Next, the enrichment of the foreground set for regulatory 
features is tested, relative to the entire 136K set. To allow 
for complex scoring methods including Hidden Markov 
Models, for cross-species conservation of motif occur- 
rences, and for including thousands of features, we have 
separated the actual scoring and ranking of the 136K set 
from the enrichment calculation (Figure 1). Offline, the 
136K set is scored for clustered and conserved occurrences 
of motifs, using thousands of PWMs and for average peak 
intensities using hundreds of iVE data sets (Table 1, 
Supplementary Table SI and Supplementary Materials 
and Methods). Online, we determine the AUC of the 
cumulative recovery curve of the candidate regions along 
the ranked list of all 136K regions (see 'Materials and 
Methods' section Enrichment via cumulative recovery 
curves), and convert the AUC scores into normalized 
enrichment scores (NES) (see Supplementary Figure S4 
for an example analysis). We observed that different 
feature types can result in different AUC distributions, 
and therefore use as a default setting in i-cisTarget to 
normalize the AUC scores separately for each feature 
collection. This way, the NES scores become comparable, 
and different feature types can be presented in one output 
table (Supplementary Figure S4). 

Identification of regulators and functional regions 
on datasets of genomic loci 

We have applied i-cisTarget to several ChlP-chip and 
ChlP-Seq experiments (Supplementary Table SI). 



Table 1. Available features in i-cisTarget a 

Version 1.0 Version 2.0 

(used in the text) (available online) 



Motifs 3731 PWMs: 

1. JASPAR (74) 

2. TRANSFAC (78) 

3. FlyFactorSurvey (77) 

4. Tiffin (75) 

5. Elemento et at. (76) 

6. Stark et al. (79) 

7. SelexConsensus (75) 

In vivo 420 iVEs: 

events 1. modENCODE (4) 

2. BDTNP (80) 

3. Furlong laboratory 
(46) 

4. chromatin states (4) 



6383 PWMs: 

Updated databases used 
in version 1.0+ 

1. YeTFaSCo (82) 

2. hPDI (81) 



536 iVEs: 

Updated databases used in 
version 1.0, now categorized as: 

1. TF binding 
(109 data sets) 

2. non-TF-binding 
(216 data sets) 

3. histone modifications 
(211 data sets). 



"Complete description of the available features can be found in 
Supplementary Materials and Methods. 



We have first evaluated the ability of i-cisTarget to find 
enrichment for the expected motifs, on a collection of 
ChIP datasets for which the targeted TF is known: 25 
ChlP-seq datasets for 22 TFs from the BDTNP consor- 
tium, 15 ChlP-chip dataset for five TFs from E. Furlong's 
lab (46), and one additional ChlP-seq dataset from 
Guertin et al. (44). These sets are tested for motif 
enrichment using version 1.0 of the PWM library 
(Table 1). 

For the BDTNP dataset, i-cisTarget can identify the 
correct motif in 18 out of the 25 cases (the TF MED is 
chipped at three different stages and HB at two different 
stages), and in llout of 25 cases this motif is among the 
three top motifs (Figure 2A, see section Validation of motif 
feature enrichment in 'Material and Methods' section for 
the details). For the Daughterless dataset, the motif is 
found with a NES of 3.9, which is just below our stringent 
threshold of 4. The more difficult cases are D, DL, GT, 
MED, TLL and BAP, for which the expected motif is not 
found as significantly enriched. De novo motif discovery 
also fails to find these motifs, indicating that these datasets 
the expected motif is indeed not enriched among these 
ChIP peaks (data not shown). For the mesodermal 
dataset, which contains ChlP-chip results for MEF2, 
BIN, BAP, TIN and TWI, the correct motif is among 
the top three scoring motifs in all cases, except for the 
MEF2_4h-6h and the BAP datasets. Finally, we 
confirmed the good performances in motif-enrichment 
on the ChlP-seq dataset of Guertin et al. (44), consisting 
of 422 regions bound by heat-shock factor (HSF) in S2 
cells, in which the HSF motif (M01244-V-HSF2_02, first 
motif) reaches an impressive NES of 26.1. The heatmap 
thus highlights the high sensitivity of our motif enrichment 
approach, as well as its specificity, given that most cells 
are empty in the heatmap, except for some known 
co-factors: Bagpipe and Biniou in the visceral mesoderm 
(47); or Knirps and Hunchback in the blastoderm (19). 
Furthermore, the NES score not only provides a qualita- 
tive indication of possible binding of the corresponding 
TF, but it also provides quantitative information on the 
amount of binding, as can be seen on the example of the 
Zelda motif. The 15 mesodermal datasets have various 
NES scores which seem to be higher for early datasets. 
Using a published ChlP-seq dataset on Zelda (48), we 
show that the amount a actual overlap of Zelda-binding 
events with the 15 mesodermal datasets is highly 
correlated to the NES score of the corresponding 
Zelda motif (Supplementary Figure S5). 

We next turned to iVE enrichment. We used as a first 
test the same HSF dataset mentioned previously (44), and 
a control set of 708 control regions that contain predicted 
HSF-binding sites but with no evidence of HSF binding in 
the ChIP data, taken from (44). Running i-cisTarget 
on both datasets using PWMs and iVEs yields the HSF 
motif as the feature with the highest NES in both datasets 
(respectively 26.1 and 23). This comes as no surprise, given 
that the control set was based on the presence of a high 
affinity TFBS. Therefore, the motif alone cannot distin- 
guish bound from unbound regions. Turning to iVEs, 
19 have a NES > 4 in the bound set, versus 6 in the 
unbound control dataset (Figure 2B). Among the 
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Figure 2. Motif and iVE discovery in sets of genomic loci. (A) Heatmaps displaying the motifs discovered in various ChIP datasets; red indicates 
that the motif ranks among the top three motifs, pink that the motif has an enrichment score above the NES threshold (NES > 4), and black 
indicates that the expected motif is not found. The grey square for da indicates that the DA motif is found with a NES of 3.9, just below our 
stringent threshold of 4. Note that the absence of the dl motif in the BDTNP DL dataset is likely due to an incorrect dataset (see text). (B, C) 
Scatterplot of NES scores for top rankings iVEs in set of bound versus unbound regions for heat-shock factor (B) and MEF2 (C). iVE directly 
related to the condition of the dataset (S2 cell for HSF; embryonic for MEF2) are represented by green triangles. 



enriched features in the positive dataset most are related to 
open and active chromatin: GAF/Trl binding, enrichment 
of DNAsel hypersensitive sites (DHS) in S2 cells, or 
presence of H3K27ac marks in S2 cells. Note that 5 of 
the 19 enriched marks are specific to S2 cells, correspond- 
ing to the cell line used for the ChlP-seq assay. On the 
other hand, the five enriched marks in the unbound set 
correspond to features for early embryonic stages (E0-4h 
caudal) and are not related to HSF binding or typical 
regulatory properties of S2 cells. The only high scoring 
feature in this set is CBP binding at pupal stage, which 
might indicate that some of the predicted TBFS could be 
bona fide binding sites at later stages or in different con- 
ditions. Indeed, 27 of the 708 HSF-unbound regions 
intersect with HSF-binding events in Kcl67 cells (49). 

To verify whether this works for other factors too, we 
have repeated the same analysis for MEF2, a TF primarily 
involved in myogenesis. We used MEF2 ChlP-chip data 
(46) at all timepoints as a positive set, and generated 
unbound regions by selecting the 4500 top-scoring 
MEF2-binding sites obtained using the Matrix-Scan 
method from RSAT (50) with the TRANSFAC MEF2 
matrix. Note that the positive and control sets were size 
matched. Performing the analysis on iVEs yields analo- 
gous results as seen for the HSF case. Namely, features 
related to open/active chromatin (e.g. DHS, GAF binding 
and CBP binding) have very high enrichment scores in the 
positive set (Figure 2C). On the other hand, the unbound 
set shows much fewer enriched features, mostly related to 



repressive marks, like insulators [e.g. Su(HW), mod2.2] or 
heterochromatin (HP2), corroborating that these MEF2 
sites are not bound in vivo. In both cases (HSF and 
MEF2), we verified that sets of random regions of 
similar sizes do not show any enriched features with an 
NES > 4. In conclusion, relevant motifs and iVEs can be 
identified from ChIP peak data sets using i-cisTarget. 

Identification of regulators and functional CRMs on 
co-expressed gene sets 

We constructed a benchmark dataset of co-expressed gene 
signatures obtained by microarray experiments [some 
were described in (28) and we have added several more 
data sets, see Supplementary Table SI]. The gene signa- 
tures were chosen in such as way that they are likely 
enriched for direct target genes of a particular TF, either 
because a gain-of-function or loss-of-function experiment 
for that TF was performed, or because the gene-expres- 
sion data were obtained in purified cells, such as the 
proneural cluster (PNC) dataset (51), where the master 
regulators are known [Su(H) and Achaete/Scute for 
neuronal specification in the PNC (51)]. Across the bench- 
mark we successfully identified the correct motif for 10 out 
of the 12 expected TFs (Figure 3A; Supplementary 
Note SI discusses the two failures, marked as black 
squares in Figure 3A). 

Our new approach of mapping a set of co-expressed 
genes to a set of predefined genomic regions (see 
'Materials and Methods' section) performs equally well, 
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if not better, compared to 'gene-based' approaches 
(28,30,31) (Supplementary Figure S6). However, the 
region-based approach now also allows identifying 
enriched iVEs on gene sets, besides PWMs. We ran 
i-cisTarget on the Zelda LOF and the PNC gene sets 
from the benchmark, this time using both the 3731 
PWMs and 420 iVE features. Remarkably, in both 
cases, by normalizing each type of regulatory features 
separately, the final ranking of regulatory features 
contains a mixture of PWM and iVEs among the top 
ranked features. On the Zelda LOF dataset, the top two 
motif clusters are the Zelda/VFL motif (best representa- 
tive motif: elemento-CAGGTAG; NES = 15.5) and the 
BCD motif (best representative: selexconsensus-oc, 
NES = 5.17); indeed, the fact that an independent 
bicoid-related iVE is found highly enriched (BDTNP- 
BCD; NES = 4.36) confirms that the motif cluster is 
likely representing BCD, rather than OC or GSC. 
Moreover, several enriched iVEs are pointing at Caudal 
(modEncode-MAT_GFP_7T-E-0-4h, BDTNP-cad) as 
an important co-factor of Zelda, while another enriched 
iVE from a different datasource is early Twist 
(Furlong-TWl-2-4h; NES = 5.10). Interestingly, these 
TFs have recently been shown to be key players in the 
maternal to zygotic activation, together with Zelda and 
STAT92E, which is also found among the enriched iVEs 
(modEncode-MAT_Stat92E_E0-12h; NES = 3.01) (52). 
Finally, several features related to H3K27me3, a 
polycomb-related repressive mark, are found enriched, in 
accordance with the tight early transcriptional control 
through PcG complexes. 

On the PNC data set the enriched PWMs represent 
the characteristic TFs involved in PNCs [Su(H) and 



Achaete/Scute] and several other relevant TFs (Pointed 
and Grainyhead) (Supplementary Figure S7; the entire 
i-cisTarget results are available from our website). 
Among the most enriched iVEs from BDTNP on the 
PNC set are BDTNP-da (NES = 6.74), derived from 
ChIP against the proneural partner Daughterless (DA) 
and BDTNP-MED (NES = 4.95), derived from ChIP 
against Medea, which is an effector of the dpp-signalling 
pathway. This is an interesting finding in the light of the 
recent discovery that SMAD proteins, the vertebrate 
homologues of Medea, co-operatively bind to CRMs 
with cell-type specific master regulators, such as Myodl 
in myotubes, Oct4 in ESCs and PU.l in pro-B cells (53). 
Our finding of DA-ChIP and MED-ChIP co-operation in 
proneural clusters cells suggests that Smad/Medea 
co-operativity with a master TF may be a conserved 
phenomenon. Finally, several iVEs from modENCODE 
are among the top features, including the H3K27me3 
ChIP data and POLII binding. Although polll binding 
is historically linked with proximal promoters, several 
recent lines of evidence indicate it can be generally 
associated with CRMs, even with distal CRMs (16,54,55) 
To confirm that i-cisTarget identifies meaningful iVEs 
from gene sets, we selected cases for which both gene- 
expression data and sets of genomic loci are available, 
and compared the enrichment scores of features on both 
types of datasets. These are the Zelda and MEF2 cases, 
having a LOF gene set (56,57), as well as a ChIP datasets 
of binding locations (46,48,57) This is an interesting 
configuration, which allows us to tackle the question of 
regulatory feature enrichment from two independent 
perspectives, in particular for iVEs; indeed, if the 
enriched iVEs identified in gene sets are truly related to 
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CRMs regulating these genes, then the same iVEs should 
also be enriched in the corresponding ChIP datasets or 
CRMs. If not, then the identified iVEs might rather be 
related to specific regions of the 136K regions overlapping 
the genes (e.g. promoters, TSS, introns) other than 
enhancer regions. The result of the comparative 
i-cisTarget analysis (see 'Material and Methods' section 
Combining gene sets and genomic loci for validation 
purposes) is striking (Figure 3B-C): apart from a few 
exceptions, the majority of enriched iVEs detected in the 
gene sets are also enriched in the corresponding CRM set. 
This is particularly true for iVEs that are directly related 
to the biological context investigated (embryonic features 
for the Mef2 case, S2-related features for the Zelda case). 
This indicates that, starting from a set of co-expressed 
genes, i-cisTarget is able to highlight the iVEs that 
are related to their co-regulation by common TFs. This 
confirms the validity and robustness of our feature enrich- 
ment approach in detecting specific features (motifs and 



iVEs) from a gene set that are relevant for the actual 
CRMs regulating these genes. 

i-cis Target accurately predicts candidate CRMs 
from gene sets 

Each enriched feature selects a subset of highly ranked 
regions from the input set. We reasoned that these are 
candidate CRMs regulating the set of input genes, and 
verified this on the two test cases presented above; we 
first used the Zelda ChlP-seq mentioned previously as 
an independent validation for the CRM predictions on 
the Zelda LOF gene set. The optimal threshold 
determined by i-cisTarget leads to 72 direct target CRM 
predictions for the CAGGTA motif, of which 62 (86%) 
intersect a Zelda ChIP peak (Figure 4A). This represents 
a 3.6-fold increase of the precision rate (i.e. positive 
predictive value or PPV) compared to the input (198 of 
the 831 input regions, or 24%, overlap with a Zelda ChIP 
peak) and clearly illustrates the high precision of the target 
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CRM predictions based on predefined genomic regions. 
We performed this analysis also on enriched iVEs, and 
on combinations of PWMs and iVEs (see 'Materials and 
Methods' section Fine-tuning the analysis by combining 
enriched features). i-cisTarget is based on ranking statistics 
and therefore allows to combine features from the same or 
across databases using OS, without the need to re-score 
a genomic sequence and to re-normalize scores across 
different feature types (41). We assessed whether the 
combination of motifs with iVEs can increase the PPV 
on CRM detection. Indeed, several of these combinations 
lead to significant increases in the precision and/or recall 
of the prediction (Figure 4A), as well as the Fl-score 
(Figure 4B; see 'Materials and Methods' section 
Validation of candidate enhancer predictions): BDTNP- 
CAD combined with the CAGGTAG motif reaches 
a PPV of 98% (87/89) and a TPR of 43% (87/198), thus 
outperforming each of the individual features. The highest 
recall (54%) is obtained for the combination of 
BDTNP-TLL with CAGGTAG. 

In the analysis of the PNC gene set, we have no 
ChIP data for Su(H) or the proneural factor to indicate 
the true CRMs. Here, we used all annotated CRMs 
from RedFly as true CRMs, in total 82 CRMs 
overlapping with 85 regions (min. 40% overlap), of 
which many are active in the PNC (e.g. enhancers in the 
E(spl) complex or the Achaete-Scute complex). Here 
we assess whether, starting from the co-expressed genes, 
the true CRMs can be identified, by following the 
highly ranked genomic regions for a particular enriched 
feature. 

Indeed, we observe a significant enrichment in true 
positive CRMs in the top ranking regions of enriched 
features. The best individual feature is BDTNP-DA, 
corresponding to the best NES score in i-cisTarget 
(PPV = 22.4%, as compared to the 85 positive regions 
versus the 1476 input regions, corresponding to 6%, 
Figure 4C). Note that BDTNP-MED shows a similar 
result, corroborating our hypothesis that the DA and 
MED TFs co-operatively bind to proneural enhancers. 
The low PPV values are due to the small number of 
validated CRMs in RedFly, and is likely largely 
underestimated. Interestingly, we found that again the 
combination of two features, particularly a PWM and 
an iVE, outperforms individual features in terms of 
precision/recall. The best feature combinations, in terms 
of the Fl score increase compared to the input, are the 
SU(H) PWM combined with either DA or MED ChIP 
data (Figure 4D). Four of the target predictions for 
[E-box, BDTNP-DA] are shown as example in 
Supplementary Figure S7 as UCSC Genome Browser 
images. 

In conclusion, the analysis on co-expressed genes 
results in meaningful regulatory features, both PWM 
features and iVEs. Using any of these top-enriched 
features, or combinations thereof, leads to the prediction 
of direct target genes (a subset of the set of input genes) 
and target CRMs, operating in the gene-regulatory 
network (GRN) underlying the biological process under 
study. 



i-cisTarget allows fast analysis of hundreds of data sets 
and predicts tissue-specific gene regulatory networks 

i-cisTarget can analyse a set of genes, or a set of genomic 
loci (even thousands of regions) in a few seconds. We have 
therefore implemented a batch analysis function in the 
online i-cisTarget application, allowing to perform one 
analysis with hundreds of data sets (using the GMT file 
format, see Supplementary Materials and Methods). To 
illustrate the potential of this feature, we analysed a large 
compendium of 628 sets of genes that are co-expressed in 
the same cell type or anatomical structure in the fly (based 
on immunohistochemistry or in situ hybridization) (58). 
Despite the sparse knowledge of validated tissue-specific 
co-localization, we identified in 290 sets an enriched motif 
for at least one TF that is co-localized in the same tissue. 
Moreover, in 188 sets, the number of tissue-specific TFs 
for which the motif is enriched, was significantly higher 
than expected by chance (Data not shown). This means 
that for these 188 sets the predicted TF-target relations are 
of high confidence and can be used to draw regulatory 
subnetworks that connect the co-expressed genes. As 
an example, we present the i-cisTarget analyses for 
four of these sets, namely 'adult mushroom body' 
(FBbt:00003684), 'Kenyon cells' (Fbbt:00003686; a child 
term of adult mushroom body), 'Pericardial cells' 
(FBbt:00005058) and 'Cardioblast' (FBbt:00001666). 
First, on a set of 48 genes expressed in the mushroom 
body and a set of 15 genes expressed in the encompassed 
Kenyon cells, we identified the motifs of three TFs that 
have a known role in these cells, namely Eyeless, MEF2 
and ECR (59,60,83). Hence, of the six TFs annotated as 
expressed in these cells, three are predicted as master regu- 
lators by i-cisTarget. This leads to several new interesting, 
high-confidence target gene predictions from which we 
derived a predicted GRN (Figure 5A). These predictions 
provide new insight into the role of the master regulator in 
the Drosophila brain, Eyeless, for which we predict 
multiple target genes like fru, Fas2, hh, Appl, til and 
Mef2. The prediction that EY could drive the hedgehog 
pathway during development of the mushroom body by 
directly controlling hh at the transcriptional level bares 
similarities to eyeless being upstream of the hedgehog 
pathway during retinal determination, where the 
hedgehog pathway drives the movement of the morpho- 
genetic furrow, downstream of eyeless (61). It is remark- 
able that the few TFs annotated with these FBbt terms are 
so highly interconnected. The same is true for the TFs 
involved in heart development, as shown in the networks 
drawn for pericardial cells and cardioblast (Figure 5B-C). 
These two networks show similarity and differences 
between cardioblasts and pericardial cells in terms of 
TF-target interactions, tinman and Mef2 are expressed in 
both cell types, and the IVEs and/or motifs for both TFs 
are found for both sets, while the other TFs are specific for 
one cell type (hth, Doc2 and Antp are expressed only in 
cardioblasts where their motifs are found enriched). 
Surprisingly, in the heart cases only the corresponding 
iVE for MEF2 (MEF2 ChIP data) is found enriched, 
while in the mushroom body data set both the MEF2 
motif and the iVE were found. It is possible that 
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MEF2-binding sites in cardiac cells are not well repre- 
sented by the available MEF2 PWMs in our library. 
Among the predicted target genes, some are already 
known [e.g. doc and Mef2 are known TIN targets in the 
cardioblast network (62)], some are present in the DrolD 
interaction database (plain or arrowed edges) and some 
are new predictions (dashed edges). In conclusion, 
meaningful gene regulatory networks can be drawn 
from i-cisTarget analyses on FBbt and other sets of 
co-expressed genes, yielding new hypotheses on direct 
regulatory interactions. 

Comparison with previous methods 

In Table 2, we have selected a number of previously pub- 
lished tools, representing a broad spectrum of approaches 
in the domain of c/s-regulatory analysis. While we have 
tried to make this selection as unbiased as possible, it is 
by far not exhaustive, and some tools are representatives 
of a whole class of comparable tools (in their scope, but 
not necessarily in their performance). For example, 



peak-motif was chosen as a representative tool for many 
motif-discovery tools in ChIP datasets. Several important 
distinctions can be made between these tools; for example, 
some tools are designed for feature extraction from a set 
of regions or genes [Clover (6), peak-motifs (13), 
Cistrome/SeqPos (63)], while other tools focus on CRM 
predictions using a learning approach based on mixture 
models [CENTIPEDE (32), Chromia (33)] or a statistical 
criterion based on local motif clustering [Cluster-Buster 
(20)]. Another distinction that can be made is the type 
of features considered. Some tools are only based on 
motif discovery or enrichment, while more recent ones 
take advantage of chromatin features like histone modifi- 
cation or chromatin structure. We have shown that 
i-cisTarget combines these different aspects: it includes 
both motifs and iVEs, extracts enriched features in a 
given dataset and predicts potential CRMs based on 
these identified features. In our opinion, the closest 
matching tool is CRMDecoder (35), which also provides 
feature extraction and CRM prediction abilities, and can 
incorporate any type of chromatin feature or annotations 
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in bed format, provided by the user. Two noticeable 
differences with i-cistarget are: (i) CRMDecoder does 
not include TF motifs for CRM predictions; and (ii) 
CRMDecoder only analyses sets of genomic loci and not 
co-expressed gene sets. We thus restricted the comparison 
of both tools to genomic loci, using iVEs extracted from 
modEncode, including DHS data from the BDTNP 
project. Since CRMDecoder does not take motifs into 
account, we applied it to the HSF and MEF2 datasets 
described previously consisting of truly bound regions, 
pooled with unbound regions containing the motif, 
asking to what extend these two classes of regions could 
be discriminated by iVEs. 

On the MEF2 dataset, which consists of 8009 bound and 
7045 unbound regions (called 'training set'), CRMDecoder 
selects 210 significant features, and identifies 3622 CRMs 
of which 1933 intersect with 6761 regions in the training 
set; 5306 of these 6761 regions are MEF2-bound regions 
(PPV = 79%). i-cisTarget on the other hand selects 19 sig- 
nificantly enriched iVEs with NES score > 4. The top 
ranking iVE (BDTNP-DHS-stgl 1) selects 1841 regions 
that intersect 3313 regions of the training set, and 3119 
of these are MEF2-bound regions (PPV = 94%). 

On the HSF dataset consisting of 422 bound and 708 
unbound regions, CRMDecoder identified DHS in Kcl67 
and S2 cells and Adult male and female CBP as the top 
ranking features (highest CCD score) is a list of 76 signifi- 
cant features (out of 369 features). Using these significant 
features, CRMDecoder predicted 2553 CRMs of which 
324 overlap 388 of the training regions. Of these 388 
regions, 307 are bound by HSF in vivo, hence a PPV of 
79%. On the same dataset, i-cistarget identified 17 iVE 
with a NES score >4, among which Adult and Pupae 
CBP (rank 1,2,3), Kcl67 and S2 DHS, as well as Polll. 
The top ranking feature (AdultMale CBP) selects 238 high 
ranking regions, which overlap 238 of the training regions, 
with 201 of these being bound, thus a PPV of 84%. While 
the PPVs of both tools are comparable in the latter case, it 
is important to note that i-cisTarget achieves this result 
using one single feature. Taking advantage of the ability of 
our tool to easily combine features, we can combine 
AdultMale_CBP with hypersensitive regions in S2 cells; 
this combination now achieves a PPV of 90%, hence a 
significant increase over the single feature. This compara- 
tive study shows that i-cisTarget is the only method that 
combines iVE discovery, motif discovery and CRM 
prediction and that i-cisTarget outperforms existing 
tools when overlapping functionalities are compared. 



DISCUSSION 

The last 15 years many bioinformatics methods and tools 
have been developed for cw-regulatory sequence analysis 
(64). Broadly, they can be divided in two categories. The 
first category is methods for motif discovery on a set of 
co-regulated sequences, such as MEME-like approaches 
(dozens of methods and extensions exist). The second 
category are methods for CRM prediction through 
whole-genome scanning using one or more known 
motifs as input, often using Hidden Markov Models and 



sequence conservation cues [see (65) for a review]. A few 
methods, such as phylCRM/Lever, ModuleMiner and 
cisTargetX combine both approaches and show increased 
motif discovery performance, even when very large 
upstream regions and introns are included in the 
analysis (28,30,31). The concept of these integrative 
methods is to apply genome-wide CRM scoring, including 
comparative genomics cues, for many different models 
(e.g. PWMs), followed by the identification of those 
particular models that yield the highest accuracy on a 
set of co-expressed genes. In this work we have introduced 
three important novelties into a new method, called 
i-cisTarget. The first is the a priori determination of 
136K regions to be scored, which leads to an increased 
flexibility. In particular, this partitioning of the genome 
allows to analyse both data sets of genomic loci (by select- 
ing all 136K regions that overlap these loci) and 
co-expressed gene sets (by selecting all 136K regions that 
fall in the upstream and intronic space of all genes in the 
set). In this study we obtained good results for a genome 
segmentation using sequence conservation (phastCons) 
combined with insulator sites, and excluding coding 
exons. However, we envision that improvements can be 
made on the genome segmentation, for example by 
including coding exons (66) or using a segmentation that 
is guided by the high-throughput data sets (i.e. the iVEs) 
themselves. The latter can become practical as more and 
more data sets are generated with overlapping results, 
which may ultimately converge to a defined set of regula- 
tory regions. The second novelty is the generalization of 
regulatory feature discovery, with the possibility to 
identify enriched motifs (as PWMs) but also enriched 
iVEs such as ChlP-peaks, and active/repressive chromatin 
marks. The third novelty is the ability to perform any 
combination of regulatory features, even across different 
types of features (e.g. a motif with a ChIP or DHS 
feature). 

Taken together, these features allow analysing most 
kinds of high-throughput data available in Drosophila, 
and to combine several analyses using the same tool for 
different datasets. For example, it is possible to combine 
the analysis of binding location data for a particular factor 
(ChIP) with the analysis of the corresponding expression 
data in mutant conditions for this factor, as we have 
shown for MEF2 (57) and Zelda (48,56). 

We have applied our tools on various datasets, distin- 
guishing gene sets from sets of genomic loci. For gene sets, 
we have shown that i-cisTarget identifies the enrichment of 
the correct motif in most gene sets we investigated; failures 
to do so might be explained by the specificity of the 
binding motif to certain conditions or tissues. Enriched 
iVEs can lead to interesting new hypotheses, such as the 
co-operation between daughterless and Medea, inferred 
from the PNC set analysis, that resembles the recent 
discovery of Smad co-operation with master regulators 
(53); or the prediction of new TF-target and TF-TF 
interactions across cell types in Drosophila, as was 
demonstrated for Kenyon cells, pericardial cells and 
cardioblasts (Figure 5). Moreover, the discovered motifs 
lead to CRM predictions in the 5kb + 5'-UTR + first 
intron of the input genes that have a high specificity to 
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be regulatory regions, as was demonstrated on the zelda 
LOF dataset (56) and the PNC dataset (51). A current 
limitation of i-cisTarget, when analysing gene sets, is the 
arbitrary assignment of genomic regions to the gene set. 
Multiple demarcations are available at the i-cisTarget web 
tool, for example [5-kb upstream limited to upstream 
gene, 5'-UTR, and first intron] or [10-kb upstream 
limited, 5'-UTR, all introns, 3'-UTR and 10-kb down- 
stream limited to downstream gene] (see 'Materials and 
Methods' section). A future challenge remains identifying 
very distal enhancers and enhancers overlapping the 
coding sequence of nearby genes (66). A simple extension 
of the sequence search space, including more sequence and 
including intronic and exonic sequences from 
neighbouring genes, will not solve the problem. Indeed, 
when applying i-cisTarget to 100-kb upstream and down- 
stream sequence of the TSS (this search space includes 
100% of REDfly CRMs), without truncating this 
sequence at neighbouring genes, the performance drops 
dramatically (see Supplementary Figure S2). 

We also used several ChIP datasets to investigate the 
performance of i-cisTarget on sets of genomic loci. Here, 
as for the gene sets, i-cisTarget performs very well in 
recovering the expected motif from a comprehensive 
library of motifs, but also highlights the involvements of 
other factors, such as Zelda or Trl in embryonic datasets. 
While motif discovery or enrichment is also performed by 
several other tools (45,67), i-cisTarget adds the possibility 
to search for additional iVEs. We have shown that a 
TF-binding site (TFBS) does not necessarily correspond 
to a binding event. While potential binding sites for HSF 
or MEF2 cannot be distinguished from actual binding 
events based on motif enrichment alone, adding iVEs 
clearly selects marks typical for active chromatin as the 
best discriminant between actually bound or unbound 
sites. We emphasize that this result is obtained ab initio, 
without any prior knowledge of which are the relevant 
iVEs. Hence, additional signals are needed for a TF to 
bind to a motif sequence, and these are often related to 
marks of open or active chromatin: DNAse hypersensitive 
sites, binding of pioneering factors such as Trl or Zelda, 
whose role as a general precursor of chromatin opening 
has only very recently been hypothesized (48). 
Interestingly, while in both HSF and Mef2 cases, the 
bound motifs present an enrichment for active features 
(GAF/Trl, CBP/p300, or DHS), the pattern of enriched 
features for unbound motifs is quite different. Namely, 
the unbound MEF2 motifs present an enrichment for 
repressive chromatin marks [Su(HW) or heterochromatin 
like features], while the unbound HSF motifs do not 
present any of these marks, consistent with what was 
reported in Guertin et al. (44). This might suggest a 
distinct mechanism of negative regulation through chro- 
matin conformation between developmental processes 
and stress response pathways. 

A feature of our approach that is not found in alterna- 
tive studies is the ability to easily combine any number of 
features to investigate the synergistic effect of different 
features. Being based on ranks, using OS allows an 
'on-the-fly' re-ranking of the 136K regions using particu- 
lar combinations. We showed on the PNC and zelda gene 



set that combinations of PWM and iVE yield higher 
1%-AUCs meaning a much higher specificity in the high 
ranking regions (Figure 4). This last result shows that 
transcriptional regulation is not a linear process, in the 
sense that the contributions of the combination of regula- 
tory features is more than the addition of individual 
contributions, revealing a synergistic mechanism of 
action. Moreover, the fact that many different regulatory 
features are found enriched in the datasets we have studied 
previously confirms that transcriptional regulation is 
intrinsically a highly combinatorial process. 

These two aspects (combinations and synergy) have 
already been extensively described before in the context 
of the enhanceosome model of regulation (68,69). In par- 
ticular, in Drosophila, analysis of a collection of curated 
CRMs showed that they are typically characterized by 
a combination of different TFBSs (70,71). This heterotypic 
model has been shown to be the general rule, while 
homotypic CRMs are generally restricted to early 
embryogenesis (71). 

However, these descriptions focused on combinatorial 
regulation by TFs alone. Here, we have confirmed recent 
evidence that this combinatorial regulation extends to 
other kinds of regulatory features such as histone modifi- 
cations, binding of chromatin-modifying proteins or tran- 
scriptional co-factors such as CBP. Hence, we propose 
that the notion of heterotypic model of regulation should 
be extended to describe any combination of regulatory 
features, including motifs and chromatin-related 
features. Similarly to the CRM finding procedure consist- 
ing of finding clusters of TFBS for different TFs (26), 
we introduce and show that searching for 'clusters' of 
regulatory features can improve the predictive power of 
regulatory sequence analysis. 

While our method currently applies to Drosophila, it can 
in principle be extended to any other organism for which 
large-scale collections of in vivo datasets are available, and 
in particular to human. The much greater size of 
non-coding regions in human, and the lower proportion 
of functional DNA in the human genome (72), would 
however require to pre-select candidate regulatory 
regions, as using a full partition of the complete 
non-coding genome would become computationally 
untractable and would contain too high noise levels. We 
are currently working on implementing i-cisTarget for 
human, using the collection of ENCODE datasets. 

SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Online: 
Supplementary Table 1, Supplementary Figures 1-7, 
Supplementary Materials and Methods and 
Supplementary Note 1. 
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